Descriptive statistics
summarise variable(s)The main function for calculating summaries on variables is summarise. Examples of descriptive functions are mean, median, sum etc. The functions consume a vector and produce a single value. summarise takes a tibble along with specification of descriptives and produces a single row.
For example, let’s say we want to know the mean height and weight of all individuals in the pulse dataset:
pulse %>% summarise(meanHeight=mean(height), meanWeight=mean(weight))
# A tibble: 1 × 2
meanHeight meanWeight
<dbl> <dbl>
1 172. 66.3
The result is a single row with two variables meanHeight and meanWeight with the corresponding mean values of all observations.
We can also summarise on variable’s range, e.g. age :
pulse %>% summarise(minAge = min(age), maxAge=max(age)) # <=> range(pulse$age)
# A tibble: 1 × 2
minAge maxAge
<dbl> <dbl>
1 18 45
n(): convenient function to calculate total number of rows in the summarise context:
pulse %>% summarise( count = n(), meanHeight = mean( height ) )
# A tibble: 1 × 2
count meanHeight
<int> <dbl>
1 110 172.
count : frequency tablesWith the count function we can count the frequency of values in a categorical variables:
pulse %>% count(gender) # frequency of male/female
# A tibble: 2 × 2
gender n
<chr> <int>
1 female 51
2 male 59
pulse %>% count(smokes) # frequency of smoking habit
# A tibble: 2 × 2
smokes n
<chr> <int>
1 no 99
2 yes 11
pulse %>% count(exercise) # frequency of exercise habit
# A tibble: 3 × 2
exercise n
<chr> <int>
1 high 14
2 low 37
3 moderate 59
The result enumerates the distinct values of the variable in the first column and their frequency in a new column n.
Multiple variables are allowed, it is the count of each possible combination of values, also known as contingency table or cross table:
pulse %>% count(gender, exercise)
# A tibble: 6 × 3
gender exercise n
<chr> <chr> <int>
1 female high 3
2 female low 20
3 female moderate 28
4 male high 11
5 male low 17
6 male moderate 31
pulse %>% count(year, gender)
# A tibble: 10 × 3
year gender n
<dbl> <chr> <int>
1 1993 female 12
2 1993 male 14
3 1995 female 11
4 1995 male 11
5 1996 female 10
6 1996 male 11
7 1997 female 8
8 1997 male 15
9 1998 female 10
10 1998 male 8
distinct values in variablesTo identify distinct values in a variable or a group of variables we use the function distinct:
pulse %>% distinct(year)
# A tibble: 5 × 1
year
<dbl>
1 1993
2 1995
3 1996
4 1997
5 1998
pulse %>% distinct(exercise)
# A tibble: 3 × 1
exercise
<chr>
1 moderate
2 high
3 low
pulse %>% distinct(ran)
# A tibble: 2 × 1
ran
<chr>
1 sat
2 ran
Again, multiple variables are allowd. To identify distinct combinations of gender and exercise:
pulse %>% distinct(gender, exercise)
# A tibble: 6 × 2
gender exercise
<chr> <chr>
1 female moderate
2 female high
3 male high
4 female low
5 male low
6 male moderate
‘distinct’ produces the same variables combinations as the ‘count’ function except without the frequncy column ‘n’.
You may use distinct also to check whether certain variables have unique values for each observation. Let’s for example check whether all individuals in the pulse dataset have different names, more precisely, each observation is uniquely identifiable by the variable name:
pulse %>% nrow() # total number of rows
[1] 110
pulse %>% distinct(name) %>% nrow() # count the number of distinct names
[1] 106
There are 106 distinct names and there in total 110 observations in the pulse dataset. This could only mean that there are multiple individuals in the pulse dataset with shared names:
nrow(pulse) == nrow( pulse %>% distinct(name)) # is 'name' unique for all observations?
[1] FALSE
arrangeYou may sort rows according to one or more variables with the function arrange.
Try sorting the pulse dataset by name:
pulse %>% arrange(name) # sorts the rows by name in dictionary order
# A tibble: 110 × 13
id name height weight age gender smokes alcohol exerc…¹ ran pulse1 pulse2
<chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 1996_C Adel… 157 41 20 female no no modera… ran 70 95
2 1996_P Adri… 180 102 20 male no yes modera… sat 76 72
3 1997_O Albe… 194 110 25 male no no modera… sat 75 75
4 1993_V Arle… 140 50 34 female no no low ran 70 98
5 1998_O Bett… 161 43 19 female no no low sat 90 89
6 1995_F Bobby 180 85 19 male yes yes modera… ran 68 125
7 1995_L Bobby 169 68 19 male no no modera… sat 58 58
8 1993_A Bonn… 173 57 18 female no yes modera… sat 86 88
9 1996_F Bran… 171 67 18 female no yes low sat 76 74
10 1996_K Brid… 160 49 19 female no no low sat 80 72
# … with 100 more rows, 1 more variable: year <dbl>, and abbreviated variable name
# ¹exercise
or by height
pulse %>% arrange(height) # numerical order
# A tibble: 110 × 13
id name height weight age gender smokes alcohol exerc…¹ ran pulse1 pulse2
<chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 1998_J Raul 68 63 19 male no no modera… ran 88 136
2 1998_N Lizz… 93 27 19 female no no low sat 119 120
3 1993_V Arle… 140 50 34 female no no low ran 70 98
4 1997_A Katr… 151 42 22 female no no low ran 85 130
5 1993_T Maura 155 50 19 female no no modera… sat 78 79
6 1995_N Tisha 155 49 18 female no yes modera… sat 104 92
7 1998_G Ursu… 155 55 20 female no yes high sat 82 87
8 1996_C Adel… 157 41 20 female no no modera… ran 70 95
9 1996_J Pene… 158 51 18 female no no modera… ran 68 84
10 1995_G Laur… 160 57 19 female no no modera… ran 75 130
# … with 100 more rows, 1 more variable: year <dbl>, and abbreviated variable name
# ¹exercise
By default the data is sorted in ascending order, to sort in descending order use desc function:
pulse %>% arrange(desc(name))
# A tibble: 110 × 13
id name height weight age gender smokes alcohol exerc…¹ ran pulse1 pulse2
<chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 1997_C Will… 190 82 19 male no no modera… sat 76 73
2 1997_F Wesl… 172 53 20 male no no low ran 72 136
3 1998_G Ursu… 155 55 20 female no yes high sat 82 87
4 1993_X Tyro… 182 75 26 male yes yes modera… sat 80 76
5 1993_J Troy 168 60 23 male no yes modera… ran 88 150
6 1993_D Trav… 195 84 18 male no yes high sat 71 73
7 1996_B Trav… 167 70 22 male yes yes low sat 92 84
8 1995_N Tisha 155 49 18 female no yes modera… sat 104 92
9 1997_I Tim 170 58.5 20 male no no low sat 80 82
10 1996_M Tayl… 180 77 18 female no no modera… ran 47 136
# … with 100 more rows, 1 more variable: year <dbl>, and abbreviated variable name
# ¹exercise
You may also arrange by multiple variables:
pulse %>% arrange(height,weight)
# A tibble: 110 × 13
id name height weight age gender smokes alcohol exerc…¹ ran pulse1 pulse2
<chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 1998_J Raul 68 63 19 male no no modera… ran 88 136
2 1998_N Lizz… 93 27 19 female no no low sat 119 120
3 1993_V Arle… 140 50 34 female no no low ran 70 98
4 1997_A Katr… 151 42 22 female no no low ran 85 130
5 1995_N Tisha 155 49 18 female no yes modera… sat 104 92
6 1993_T Maura 155 50 19 female no no modera… sat 78 79
7 1998_G Ursu… 155 55 20 female no yes high sat 82 87
8 1996_C Adel… 157 41 20 female no no modera… ran 70 95
9 1996_J Pene… 158 51 18 female no no modera… ran 68 84
10 1996_K Brid… 160 49 19 female no no low sat 80 72
# … with 100 more rows, 1 more variable: year <dbl>, and abbreviated variable name
# ¹exercise
Here the data is first ordered by height and then by weight.
Copyright © 2023 Biomedical Data Sciences (BDS) | LUMC